The Center for Policing Equity (CPE) is a team of researchers, experts in race and equity, data analysts, and community trainers who work together to create fair and just systems. Their approach involves using data and science as tools and collaborating with both law enforcement and communities to bridge communication gaps and mistrust, and promote public safety, community trust, and racial equity.
Police departments from all over the United States are joining the National Justice Database, which is the largest standardized collection of police behavioral data. In exchange for access to their records, which include use of force incidents, vehicle and pedestrian stops, calls for service, and crime data, the CPE’s scientists use advanced analytics to identify disparities in policing, shed light on police behavior, and provide actionable recommendations. Their custom reports are highly-detailed and provide police departments with guidance on improving public safety, restoring trust, and aligning their work with their own values.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(leaflet)
library(viridisLite)
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(knitr)
This Kaggle dataset contains information from the National Justice Database created by the Center for Policing Equity (CPE). It includes over 50 million records of police interactions with the public, including vehicle and pedestrian stops, use of force incidents, and calls for service, from over 30 police departments across the United States. The dataset is designed to help data scientists and researchers identify patterns of racial and ethnic disparities in policing and to support efforts to improve police-community relations.
#To read the csv file
df <- read.csv("37-00049_UOF-P_2016_prepped.csv")
The dataset is loaded and duplicates from dataset is deleted as the top rows first and second contain column names. So reducing the dimensionality of the dataset. The dataset contains 2384 rows with 47 columns. Summary() of the column is to get the summary of the variables used in the dataframe.
# Keep only unique rows in the data frame 'df' and store the result in 'df_unique'
df_unique <- unique(df)
# Print the dimensions (number of rows and columns) of 'df_unique'
print(dim(df_unique))
## [1] 2384 47
# Remove the first row of the data frame
df_unique <- df_unique[-1,]
# Print a summary of the data frame
summary(df_unique)
## INCIDENT_DATE INCIDENT_TIME UOF_NUMBER OFFICER_ID
## Length:2383 Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## OFFICER_GENDER OFFICER_RACE OFFICER_HIRE_DATE
## Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## OFFICER_YEARS_ON_FORCE OFFICER_INJURY OFFICER_INJURY_TYPE
## Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## OFFICER_HOSPITALIZATION SUBJECT_ID SUBJECT_RACE
## Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## SUBJECT_GENDER SUBJECT_INJURY SUBJECT_INJURY_TYPE SUBJECT_WAS_ARRESTED
## Length:2383 Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## SUBJECT_DESCRIPTION SUBJECT_OFFENSE REPORTING_AREA BEAT
## Length:2383 Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## SECTOR DIVISION LOCATION_DISTRICT STREET_NUMBER
## Length:2383 Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## STREET_NAME STREET_DIRECTION STREET_TYPE
## Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION LOCATION_CITY
## Length:2383 Length:2383
## Class :character Class :character
## Mode :character Mode :character
## LOCATION_STATE LOCATION_LATITUDE LOCATION_LONGITUDE INCIDENT_REASON
## Length:2383 Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## REASON_FOR_FORCE TYPE_OF_FORCE_USED1 TYPE_OF_FORCE_USED2 TYPE_OF_FORCE_USED3
## Length:2383 Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## TYPE_OF_FORCE_USED4 TYPE_OF_FORCE_USED5 TYPE_OF_FORCE_USED6
## Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## TYPE_OF_FORCE_USED7 TYPE_OF_FORCE_USED8 TYPE_OF_FORCE_USED9
## Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## TYPE_OF_FORCE_USED10 NUMBER_EC_CYCLES FORCE_EFFECTIVE
## Length:2383 Length:2383 Length:2383
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
# Replace the original data frame 'df' with the cleaned-up version 'df_unique'
df <- df_unique
str(df)
## 'data.frame': 2383 obs. of 47 variables:
## $ INCIDENT_DATE : chr "9/3/16" "3/22/16" "5/22/16" "1/10/16" ...
## $ INCIDENT_TIME : chr "4:14:00 AM" "11:00:00 PM" "1:29:00 PM" "8:55:00 PM" ...
## $ UOF_NUMBER : chr "37702" "33413" "34567" "31460" ...
## $ OFFICER_ID : chr "10810" "7706" "11014" "6692" ...
## $ OFFICER_GENDER : chr "Male" "Male" "Male" "Male" ...
## $ OFFICER_RACE : chr "Black" "White" "Black" "Black" ...
## $ OFFICER_HIRE_DATE : chr "5/7/14" "1/8/99" "5/20/15" "7/29/91" ...
## $ OFFICER_YEARS_ON_FORCE : chr "2" "17" "1" "24" ...
## $ OFFICER_INJURY : chr "No" "Yes" "No" "No" ...
## $ OFFICER_INJURY_TYPE : chr "No injuries noted or visible" "Sprain/Strain" "No injuries noted or visible" "No injuries noted or visible" ...
## $ OFFICER_HOSPITALIZATION : chr "No" "Yes" "No" "No" ...
## $ SUBJECT_ID : chr "46424" "44324" "45126" "43150" ...
## $ SUBJECT_RACE : chr "Black" "Hispanic" "Hispanic" "Hispanic" ...
## $ SUBJECT_GENDER : chr "Female" "Male" "Male" "Male" ...
## $ SUBJECT_INJURY : chr "Yes" "No" "No" "Yes" ...
## $ SUBJECT_INJURY_TYPE : chr "Non-Visible Injury/Pain" "No injuries noted or visible" "No injuries noted or visible" "Laceration/Cut" ...
## $ SUBJECT_WAS_ARRESTED : chr "Yes" "Yes" "Yes" "Yes" ...
## $ SUBJECT_DESCRIPTION : chr "Mentally unstable" "Mentally unstable" "Unknown" "FD-Unknown if Armed" ...
## $ SUBJECT_OFFENSE : chr "APOWW" "APOWW" "APOWW" "Evading Arrest" ...
## $ REPORTING_AREA : chr "2062" "1197" "4153" "4523" ...
## $ BEAT : chr "134" "237" "432" "641" ...
## $ SECTOR : chr "130" "230" "430" "640" ...
## $ DIVISION : chr "CENTRAL" "NORTHEAST" "SOUTHWEST" "NORTH CENTRAL" ...
## $ LOCATION_DISTRICT : chr "D14" "D9" "D6" "D11" ...
## $ STREET_NUMBER : chr "211" "7647" "716" "5600" ...
## $ STREET_NAME : chr "Ervay" "Ferguson" "bimebella dr" "LBJ" ...
## $ STREET_DIRECTION : chr "N" "NULL" "NULL" "NULL" ...
## $ STREET_TYPE : chr "St." "Rd." "Ln." "Frwy." ...
## $ LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION: chr "211 N ERVAY ST" "7647 FERGUSON RD" "716 BIMEBELLA LN" "5600 L B J FWY" ...
## $ LOCATION_CITY : chr "Dallas" "Dallas" "Dallas" "Dallas" ...
## $ LOCATION_STATE : chr "TX" "TX" "TX" "TX" ...
## $ LOCATION_LATITUDE : chr "32.782205" "32.798978" "32.73971" "" ...
## $ LOCATION_LONGITUDE : chr "-96.797461" "-96.717493" "-96.92519" "" ...
## $ INCIDENT_REASON : chr "Arrest" "Arrest" "Arrest" "Arrest" ...
## $ REASON_FOR_FORCE : chr "Arrest" "Arrest" "Arrest" "Arrest" ...
## $ TYPE_OF_FORCE_USED1 : chr "Hand/Arm/Elbow Strike" "Joint Locks" "Take Down - Group" "K-9 Deployment" ...
## $ TYPE_OF_FORCE_USED2 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED3 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED4 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED5 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED6 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED7 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED8 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED9 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED10 : chr "" "" "" "" ...
## $ NUMBER_EC_CYCLES : chr "NULL" "NULL" "NULL" "NULL" ...
## $ FORCE_EFFECTIVE : chr " Yes" " Yes" " Yes" " Yes" ...
A frequency table was created to analyze the subject race and their arrest status. The table revealed that a significant number of Black subjects were arrested, followed by Hispanic and White subjects. However, it was also observed that the Black and Hispanic populations were larger than the White population, which may have contributed to the higher number of arrests.
#TABLE
df1<-table(subset(df, SUBJECT_RACE != "NULL")$SUBJECT_RACE, subset(df, SUBJECT_RACE != "NULL")$SUBJECT_WAS_ARRESTED)
The ggplot() function creates a bar plot with subject race on the x-axis and the fill color representing each race. The geombar() function is used to create the bars for each race, with the height of each bar representing the frequency of arrests.
The function ggplotly() then converts the object into an interactive plot allowing the user can hover over each bar to see the exact count of arrests for each race. The interactive plot also allows the user to zoom in and out, pan the plot, and download the plot as a PNG image.
Overall, this plot provides that among the subject Black race was the highest followed by Hispanic , which can help identify any potential racial disparities in policing.
#Barchart
gg <- ggplot(df, aes(x = SUBJECT_RACE, fill = SUBJECT_RACE)) +
geom_bar() +
labs(title = "Frequency of Arrests by Subject Race",
x = "Subject Race",
y = "Frequency")
ggplotly(gg)
Among the officer population , the White race among the highest followed by Hispanic and Black. This could lead to assumption there might be racial bias against particular races among the subject.
gg <- ggplot(df, aes(x = OFFICER_RACE, fill = OFFICER_RACE)) +
geom_bar() +
labs(title = "Frequency of Arrests by Officer Race",
x = "Officer Race",
y = "Frequency")
ggplotly(gg)
To investigate the possibility of racial bias, we examined the frequency of arrests across different races. Upon analysis, we observed that the count of arrests among White, Black, and Hispanic subjects was notably higher compared to other racial groups.
#STACKED BAR CHART
gg <- ggplot(df, aes(x = SUBJECT_RACE, fill = SUBJECT_WAS_ARRESTED)) +
geom_bar() +
labs(title = "Frequency of Arrests by Subject Race and Arrest Status",
x = "Subject Race",
y = "Frequency",
fill = "Arrest Status")
ggplotly(gg)
df_pie <- data.frame(table(df$OFFICER_HOSPITALIZATION))
gg <- ggplot(df_pie, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar(theta = "y") +
scale_fill_manual(values = c("#FF5733", "#FFC300")) +
labs(title = "Officer Hospitalization",
subtitle = "Distribution of Officer Hospitalization in the Dataset",
fill = "Hospitalization",
x = NULL,
y = NULL) +
theme_void() +
theme(legend.position = "bottom",
panel.border = element_rect(color = "black", fill = NA, size = 1),
plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5))
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
gg
To further investigate the nature of incidents between the subjects and officers, we analyzed the level of violence exhibited by the subjects towards the officers. Our analysis revealed that the subjects were relatively less violent towards the officers, which may explain the lower rate of officer hospitalization.
df$INCIDENT_DATE <- as.Date(df$INCIDENT_DATE, format = "%m/%d/%Y")
# Replace "00" with "20" in the incident date
df$INCIDENT_DATE <- gsub("00","20",df$INCIDENT_DATE)
# Reformat the date column to the desired format
df$INCIDENT_DATE <- as.Date(df$INCIDENT_DATE, format = "%Y-%m-%d")
# Convert the time column to a 24-hour format
df$INCIDENT_TIME <- format(strptime(df$INCIDENT_TIME, "%I:%M:%S %p"), "%H:%M:%S")
# Extract the month name from the incident date
df$INCIDENT_MONTH <- months(as.Date(df$INCIDENT_DATE))
# Extract the month number from the incident date
df$INC_MONTH <- format(df$INCIDENT_DATE,"%m")
# Extract the hour from the incident time
df$INCIDENT_HOUR <- as.numeric(substr(df$INCIDENT_TIME, 0, 2))
# Calculate the day of the week from the incident date and store it as a label
df$INCIDENT_DAY <- wday(df$INCIDENT_DATE, label=TRUE)
# Extract the hour from the incident time as a string
df$INC_HOUR <- substr(df$INCIDENT_TIME, 0, 2)
# Extract the day from the incident date as a string
df$INC_DATE <- substr(df$INCIDENT_DATE, 9, 10)
df$INCIDENT_MONTH <- factor(df$INCIDENT_MONTH, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
ggplot(df, aes(x = INCIDENT_MONTH)) +
stat_count(binwidth = 1, fill = "#FF4500", color = "white") +
scale_fill_manual(values=c("#4B0082", "#9400D3", "#00BFFF", "#1E90FF", "#00CED1", "#32CD32", "#FFD700", "#FFA500", "#FF4500", "#B22222", "#8B0000", "#FF69B4")) +
labs(x = "Month", y = "Number of crimes incidents", title = "Distribution of incidents in a month") +
theme(panel.border = element_rect(colour = "black", fill = NA, size = 1))
## Warning in stat_count(binwidth = 1, fill = "#FF4500", color = "white"): Ignoring
## unknown parameters: `binwidth`
After extracting the incident date, time, month, and hour from the dataset, we conducted an analysis to determine which month had the highest rate of crimes. Our analysis revealed that the crime rate was particularly high in January and decreased during the winter break.
This finding suggests that there may be a seasonal pattern to crime, with higher rates during certain months of the year.
#Density plot
ggplot(df, aes(INCIDENT_HOUR)) +
geom_density(fill = "red", alpha = 0.5) +
labs(x = "Incident Hour", y = "Density", title = "Density Plot of Incident Hours") +
scale_x_continuous(breaks = 0:23) +
theme_minimal()
## Warning: Removed 10 rows containing non-finite values (`stat_density()`).
After identifying the month with the highest crime rate, we conducted a further analysis to determine which hours of the day crimes were most likely to occur. Not surprisingly, our analysis revealed that a large number of crimes were committed during late hours, with a decrease in crime rates during the morning busy hours.
This finding highlights the importance of considering both the month and time of day when analyzing crime patterns.
#Cluster maps: Cluster maps are useful for visualizing groups of events that are close to each other.
df_new <- df %>%filter(LOCATION_LATITUDE != "Latitude", LOCATION_LONGITUDE != "Longitude")
map <- leaflet(data = df_new) %>%
addTiles() %>%
addMarkers(
lat = ~as.numeric(LOCATION_LATITUDE),
lng = ~as.numeric(LOCATION_LONGITUDE),
clusterOptions = markerClusterOptions(),
popup = ~paste(
"<strong>Incident Reason:</strong>", INCIDENT_REASON, "<br>",
"<strong>Subject Race:</strong>", SUBJECT_RACE, "<br>",
"<strong>Officer Race:</strong>", OFFICER_RACE, "<br>",
"<strong>Was Arrested:</strong>", SUBJECT_WAS_ARRESTED
)
)
## Warning in validateCoords(lng, lat, funcName): Data contains 55 rows with either
## missing or invalid lat/lon values and will be ignored
map
Upon determining the specific time and day when the crime occurred, we proceed to plot its location using latitude and longitude data. This allows us to narrow down our investigation to the exact street number, street name, street direction, and street type where the incident took place. By doing so, we aim to gain a deeper understanding of the underlying reasons for the crime, including the race of the perpetrator and any involved law enforcement officers. Additionally, we seek to ascertain whether the subject was apprehended or if they managed to escape.
df_filtered <- df %>% filter(SUBJECT_RACE !="NULL",SUBJECT_RACE !="NULL")
summary_table <- df_filtered %>%
filter(!is.na(SUBJECT_RACE) & !is.na(SUBJECT_GENDER)) %>%
group_by(SUBJECT_RACE, SUBJECT_GENDER) %>%
summarize(count = n())
## `summarise()` has grouped output by 'SUBJECT_RACE'. You can override using the
## `.groups` argument.
# Create the interactive plot to display count of incidents
plot_ly(data = summary_table,
x = ~SUBJECT_RACE,
y = ~count,
type = 'bar',
color = ~SUBJECT_GENDER,
text = ~count,
textposition = 'auto') %>%
layout(title = "Use of Force Incidents by Subject Race and Gender",
xaxis = list(title = "Subject Race"),
yaxis = list(title = "Incident Count"))
Futher investigation also focuses on understanding the frequency of incidents that were recorded. Our analysis reveals that Black males accounted for a higher number of incidents compared to other genders, and that incidents involving Black females were less violent in nature. Additionally, incidents involving White individuals had a relatively higher count than other races.
summary_table <- df %>%
group_by(OFFICER_YEARS_ON_FORCE) %>%
summarize(count = n())
# Create the interactive plot
plot_ly(data = summary_table,
x = ~OFFICER_YEARS_ON_FORCE,
y = ~count,
type = 'scatter',
mode = 'markers',
marker = list(size = 10)) %>%
layout(title = "Officer Years on Force vs. Number of Use of Force Incidents",
xaxis = list(title = "Years on Force"),
yaxis = list(title = "Incident Count"))
An interesting trend that has emerged from our analysis is that as the number of years of service for law enforcement officers increases, the frequency of force incidents used during their encounters with subjects decreases. On the other hand, younger officers tend to use a higher number of forces when dealing with subjects.
gender <- df$OFFICER_GENDER
experience <- as.numeric(df$OFFICER_YEARS_ON_FORCE)
df1 <- data.frame(gender, experience)
# Create the sinaplot
ggplot(df1, aes(x = gender, y = experience, fill = gender)) +
geom_violin(scale = "width", trim = FALSE, show.legend = FALSE) +
geom_point(aes(color = gender), position = position_jitterdodge(), size = 0.7) +
scale_fill_manual(values = c("orange", "blue")) +
scale_color_manual(values = c("black", "black")) +
labs(x = "Gender", y = "Years on Force") +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "none"
)
Combination of a violin plot and a jitter plot, which is used to visualize the distribution of experience (Years on Force) for different genders. This plot helps understand the distribution of experience among different genders and provides insights into potential trends, differences, or similarities in the data.Its noticed that male tend to stay in force for longer than the female.
density_plot <- ggplot(data = df) +
aes(x = OFFICER_YEARS_ON_FORCE, fill = OFFICER_RACE) +
geom_density(alpha = 0.5) +
theme_minimal() +
labs(title = "Density Plot of Officer Years on Force",
x = "Years on Force",
y = "Density")
interactive_density_plot <- ggplotly(density_plot)
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
interactive_density_plot